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Abstract — In this paper, we consider tlie problem of deploy- 
ing a robot from a specification given as a temporal logic 
statement about some properties satisfied by the regions of 
a large, partitioned environment. We assume that the robot 
has noisy sensors and actuators and model its motion through 
the regions of the environment as a Markov Decision Process 
(MDP). The robot control problem becomes finding the control 
policy maximizing the probability of satisfjing the temporal 
logic task on the MDP. For a large environment, obtaining 
transition probabilities for each state-action pair, as vcell as 
solving the necessary optimization problem for the optimal 
policy are usually not computationally feasible. To address 
these issues, we propose an approximate dynamic programming 
framework based on a least-square temporal difference learning 
method of the actor-critic type. This framework operates on 
sample paths of the robot and optimizes a randomized control 
policy with respect to a small set of parameters. The transition 
probabilities are obtained only when needed. Hardware-in-the- 
loop simulations confirm that convergence of the parameters 
translates to an approximately optimal policy. 

Index Terms — Motion planning, Markov Decision Processes, 
dynamic programming, actor-critic methods. 

I. Introduction 

One major goal in robot motion planning and control is 
to specify a mission task in an expressive and high-level 
language and convert the task automatically to a control 
strategy for the robot. The robot is subject to mechanical 
constraints, actuation and measurement noise, and limited 
communication and sensing capabilities. The challenge in 
this area is the development of a computationally efficient 
framework accommodating both the robot constraints and 
the uncertainty of the environment, while allowing for a large 
spectrum of task specifications. 

In recent years, temporal logics such as Linear Temporal 
Logic (LTL) and Computation Tree Logic (CTL) have been 
promoted as formal task specification languages for robotic 
applications [l]-[6]. They are appealing due to their high 
expressivity and closeness to human language. Moreover, 
several existing formal verification [7], [8] and synthesis [8] 
tools can be adapted to generate motion plans and provably 
correct control strategies for the robots. 
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In this paper, we assume that the robot model in the envi- 
ronment is described by a (finite) Markov Decision Process 
(MDP). In this model, the robot can precisely determine its 
current state, and by applying an action (corresponding to a 
motion primitive) enabled at each state, it triggers a transition 
to an adjacent state with a fixed probability. We are interested 
in controlling the MDP robot model such that it maximizes 
the probability of satisfying a temporal logic formula over 
a set of properties satisfied at the states of the MDP. By 
adapting existing probabilistic model checking [8J-[10J and 
synthesis |11], [12] algorithms, we recently developed such 
computational frameworks for formulas of LTL [13] and a 
fragment of probabilistic CTL [14J. 

With the above approaches, an optimal control policy can 
be generated to maximize the satisfying probability, given 
that the transition probabilities are known for each state- 
action pair of the MDP, which can be computed by using 
a Monte-Carlo method and repeated forward simulations. 
However, it is often not feasible for realistic robotic appli- 
cations to obtain the transition probabilities for each state- 
action pair, even if an accurate model or a simulator of 
the robot in the environment is available. Moreover, the 
problem size is even larger when considering temporal logic 
specifications. For example, in order to find an optimal poUcy 
for an MDP satisfying an LTL formula, one need to solve 
a dynamical programming problem on the product between 
the original MDP and a Rabin automaton representing the 
formula. As such, exact solution can be computationally 
prohibitive for realistic settings. 

In this paper, we show that approximate dynamic pro- 
gramming [15] can be effectively used to address the above 
limitations. For large dynamic programming problems, an 
approximately optimal solution can be provided using actor- 
critic algorithms [16]. In particular, actor-critic algorithms 
with Least Squares Temporal Difference (LSTD) learning 
have been shown recently to be a powerful tool to solve 
large-sized problems [17], [18]. This paper extends from 
[19], in which we proposed an actor-critic method for 
maximal reachability (MRP) problems, i.e., maximizing the 
probability of reaching a set of states, to a computational 
framework that finds a control policy such that the probabil- 
ity of its paths satisfying an arbitrary LTL formula is locally 
optimal over a set of parameters. This set of parameters is 
designed to tailor to this class of approximate dynamical 
prograrmning problems. 

Our proposed algorithm produces a randomized policy, 
which gives a probability distribution over enabled actions 
at a state. Our method requires transition probabihties to 
be generated only along sample paths, and is therefore 




Fig. 1. Robotic InDoor Environment (RIDE) platform. Left: An iCreate 
mobile platform moving autonomously through the corridors and intersec- 
tions of an indoor-like environment. Right: The partial schematics of the 
environment. The black blocks represent walls, and the grey and white 
regions are intersection and corridors, respectively. The labels inside a region 
represents observations associated with regions, such as Un (unsafe regions) 
and Ri (risky regions). 

particularly suitable for robotic applications. To the best of 
our knowledge, this is the first of combining temporal logic 
formal synthesis with actor-critic type methods. We illustrate 
the algorithms with hardware-in-the-loop simulations using 
an accurate simulator of our Robotic InDoor Environment 
(RIDE) platform [20]. 

Notation: We use bold letters to denote sequences and 
vectors. Vectors are assumed to be column vectors. Transpose 
of a vector x is denoted by x^. || • || stands for the Euclidean 
norm. \S\ denotes the cardinality of a set S. 

II. Problem Formulation and Approach 

We consider a robot moving in an environment partitioned 
into regions such as the Robotic Indoor Environment (RIDE) 
(see Fig. [T}. Each region in the environment is associated 
with a set of observations. Observations can be Un for unsafe 
regions, or Up for a region where the robot can upload 
data. We assume that the robot can detect its current region. 
Moreover, the robot is programmed with a set of motion 
primitives allowing it to move from a region to an adjacent 
region. To capture noise in actuation and sensing, we make 
the natural assumption that, at a given region, a motion 
primitive designed to take the robot to a specific adjacent 
region may take the robot to a different adjacent region. 

Such a robot model naturally leads to a labeled Markov 
Decision Process (MDP), which is defined below. 
Definition II. 1 (Labeled Markov Decision Process). A la- 
beled Markov decision process (MDP) is a tuple M = 
(Q, go 7 U, A, P, n, h), where 

(i) Q = {1, . . . ,n} is a finite set of states; 

(ii) qa (z Q is the initial state; 
(Hi) U is a finite set of actions; 

(iv) A : Q ^ U maps a state q € Q to actions enabled at 

q; 

(v) P : Q X U X Q —i^ [0,1] is the transition probability 
function such that for all q E Q, J2q'eQ l') ^ ^ 
if u (z A{q), and P{q,u,q') — for all q' E Q if 
u i A{q); 



( vi) n is a set of observations; 

(vii) h : Q ^ 2^ is the observation map. 

Each state of the MDP M modeling the robot in the 
environment corresponds to an ordered set of regions in the 
environment, while the actions label the motion primitives 
that can be applied at a region. For example, a state of 
A4 may be labelled as Ii-Ci, which means that the robot 
is currently at region Ci, coming from region Ii. Each 
ordered set of regions corresponds to a recent history of 
the robot trajectory, and is needed to ensure the Markov 
property (more details on such MDP abstraction of the robot 
in the environment can be found in e.g., [14]). The transition 
probability function P can be obtained through extensive 
simulations of the robot in the environment. We assume 
that there exists an accurate simulator that is capable of 
generating (computing) the transition probability P{q,u,-) 
for each state-action pair q G Q and u G A{q). More details 
of the construction of the MDP model for a robot in the 
RIDE platform are included in Sec. |IV] 

If the exact transition probabilities are not known, Ai 
can be seen as a labeled non-deterministic transition system 
(NTS) M-^ = {Q,qo,U,A,P-^,Il,h), where P in 7W is 
replaced by F-^ : QxU xQ ^ {0,1}, and P-^(g, m, g') = 1 
indicates a possible transition from q to q' applying an 
enabled action u G A{q); if P^ {q,u,q') = 0, then the 
transition from q to q' is not possible under u. 

A path on is a sequence of states q = q^qi . . . such 
that for all A: > 0, there exists Uk € A{qk) such that 
P{qk,Uk,qk+i) > 0. Along a path q = qoqi . . ., qu is 
said to be the state at time k. The trajectory of the robot 
in the environment is represented by a path q on (which 
corresponds to a sequence of regions in the environment). 
A path q = qiq2 ■ . ■ generates a sequence of observations 
h(q) := 01O2 . . ., where Ofc = h{qk) for all fc > 0. We call 

= h(q) the word generated by q. 

Definition II.2 (Policy). A control policy for an MDP M. is 

an infinite sequence M = /^oMi • ■ ■> where fi^ ■ Q x U ^ 
[0, 1] is such that J2ueA{q) l^kiq, u) = 1, for all k > 0. 

Namely, at time fc, fikiq, •) is a discrete probability distri- 
bution over A{q). If /i = /ife for all fc > 0, then M — fifjL . . . 
is called a stationary policy. If for all fc > 0, ^k{q,u) = 

1 for some u, then M is deterministic; otherwise, M is 
randomized. Given a policy M, we can then generate a set 
of paths on M., by applying Uk with probability iJ,k{qk,Uk) 
at state qk for all time fc. 

We require the trajectory of the robot in the environment 
to satisfy a rich task specification given as a Linear Temporal 
Logic (LTL) (see, e.g., [7], [8]) formula over a set of 
observations II. An LTL formula over II is evaluated over 
an (infinite) sequence o = oqOi . . . {e.g., a word generated 
by a path on J\4), where Ok ^ n for all fc > 0. We denote 
o N (/) if word o satisfies the LTL formula <j), and we say q 
satisfies (p if h(q) N (j>. Roughly, (j> can be constructed from 
a set of observations II, Boolean operators ^ (negation), 
V (disjunction), A (conjunction), — > (implication), and 
temporal operators X (next), U (until), F (eventually), G 



(always). A variety of robotic tasks can be easily translated 
to LTL formulas. For example, the following complex task 
command in natural language: "Gather data at locations Da 
infinitely often. Only reach a risky region Ri valuable data 
at VD can be gathered, and always avoid unsafe regions 
fUnj" can be translated to the LTL formula: 

(/) := G FDa A G (Ri — > VD) A G ^Un. 



In this paper, we consider the following problem. 

Problem II.3. Given a labeled MDP M — 
(Q, qo,U, A, P,Il, h) modeling the motion of a robot 
in a partitioned environment and a mission task specified 
as an LTL formula (j) over 11, find a control policy that 
maximizes the probability of its path satisfying (f). 

The probability that paths generated under a policy M 
satisfy an LTL formula (f) is well defined with a suitable 
measure over the set of all paths generated by M [8]. 

In [13], we proposed a computational framework to solve 



Prob. II. 3 by adapting methods from the area of probabilistic 
model checking [8]-[10]. However, this framework relies 
upon the fact that the transition probabilities are known 
for all state-action pairs. These transition probabilities are 
typically not available for robotic applications and com- 
putationally expensive to compute. Moreover, even if the 
transition probabilities are obtained for each state-action 
pair, this method still requires solving a linear program on 
the product of the MDP and the automata representing the 
formula, which can be very large (thousands or even millions 
of states). In this case an approximate method might be 
more desirable. For these reasons, we instead focus on the 
following problem. 

Problem II.4. Given a labeled NTS = 
(Q, qo,U, A, P-^ ,11, h) modeling a robot in a partitioned 
environment, a mission task specified as an LTL formula (p 
over n, and an accurate simulator to compute transition 
probabilities P{q,u,-) given a state-action pair {q,u), 
find a control policy that approximately maximizes the 
probability of its path satisfying (j). 

In many robotic applications, the NTS model Ai-^ — 
{Q, qo,U, A, P-^ ,11, h) can be quickly constructed for the 
robot in the environment. Our approach to Prob. II. 4 can 



be summarized as follows: First, we proceed to translate 
the problem to a maximal reacha bility probability (MRP) 



problem using A4'^ and cf) (Sec. |III-A|. We then use an 



actor critic framework to find a randomized policy giving 



an approximate solution to the MRP problem (Sec. III-Bi 



The randomized policy is constructed to be a function of a 
small set of parameters and we find a policy that is locally 
optimal with respect to these parameters. The construction of 
a class of policies suitable for MRP problems without using 



the transition probabilities is explained in Sec. III-C The 
algorithmic framework presented in this paper is summarized 
in Sec. BILDI 



III. Control Synthesis 

A. Formulation of the MRP Problem 

The formulation of the MRP problem is based on [8]-[10], 
[13] with modification if needed when using the NTS Aij^f 
instead of J\4. We start by converting the LTL formula 
over n to a so-called deterministic Rabin automaton, which 
is defined as follows. 

Definition III.l (Deterministic Rabin Automaton). A de- 
terministic Rabin automaton (DRA) is a tuple TZ = 

(S, Sq, S, S, F), where 

(i) S is a finite set of states; 

(ii) So G 5 is the initial state; 

( Hi) S is a set of inputs ( alphabet); 

(iv) 6 : S X Y, S is the transition function; 

(v) F = {{L{1),K {!)),..., {L{M),K{M))} is a set of 
pairs of sets of states such that L{i),K(i) C S for all 
i = l,...,M. 

A run of a Rabin automaton TZ, denoted by r = sqSi . . ., is 
an infinite sequence of states in TZ such that for each fc > 0, 
Sfc+i G d{sk,a) for some a e S. A run r is accepting if 
there exists a pair {L, K) E F such that r intersects with 
L finitely many times and K infinitely many times. For any 
LTL formula (j) over 11, one can construct a DRA (for which 
we denote by TZ^) with input alphabet = 2^ accepting all 
and only words over 11 that satisfy (see [21]). 

We then obtain an MDP as the product of a labeled 
MDP M and a DRA T?.^, which captures all paths of 
A4 satisfying Note that this product MDP can only be 
constructed from an MDP and a deterministic automaton, 
this is why we require a DRA instead of, e.g., a (generally 
non-deterministic) Biichi automaton (see [8]). 
Definition III.2 (Product MDP). The product MDP M x 
TZ^ between a labeled MDP M = {Q,qo,U, A, P,Il,h) 
and a DRA TZ4, = {S, So,2^,S, F) is an MDP "P = 
{S-p, s-pQ,U-p, A-p, P-p,Il,h-p), where 

(i) S-p — Q X S is a set of states; 

(ii) s-pQ ~ {qo,SQ) is the initial state; 

(Hi) U-p — U is a set of actions inherited from M; 

(iv) A-p is also inherited from Ai and A-p{[q,s)) :— A{q); 

(v) P-p gives the transition probabilities: 



Pv{{q,s),u,{q',s'))^- 



P{q,u,q') ifq' = S(s,h{q)) 







otherwise; 



Note that h-p is not used in the product MDP. Moreover, P 
is associated with pairs of accepting states (similar to a DRA) 
F-p := {{Lr{l),Kr{l)),...,iL-p{M),Kr{M))} where 
Lp,{i) = Q X L{i), K-p{i) = Qx K{i), for i^l,...,M; 

The product MDP is constructed in a ways such that, given 
a path (so, qo){si, qi) ■ ■ ., the corresponding path sgSi ... on 
A4 satisfies if and only if there exists a pair {Lp, K-p) € 
F-p satisfying the Rabin acceptance condition, i.e., the set 
is visited infinitely often and the set L-p is visited finitely 
often. 

We can make a very similar product between a la- 
beled NTS = {(Q,qo,U,A,P^ ,Xl,h) and TZ^. This 



product is also an NTS, which we denote by — 
{S-p,s-po,U-p,A-p,P:^,U,h-p) := x T?.^, associated 

with accepting sets F-p. The definition (and the accepting 
condition) of V"^ is exactly the same as for the product 
MDP. The only difference between V-^ and V is in P^, 
which is either or 1 for every state-action-state tuple. 

From the product V or equivalently V'^, we can proceed 
to construct the MRP problem. To do so, it is necessary to 
produce the so-called accepting maximum end components 
(AMECs). An end component is a subset of an MDP 
(consisting of a subset of states and a subset of enabled 
actions at each state) such that for each pair of states (z, j) 
in V, there is a sequence of actions such that i can be 
reached from j with positive probability, and states outside 
the component cannot be reached. An AMEC of V is the 
largest end component containing at least one state in K-p 
and no state in L-p, for a pair {K-p,L-p) e Fp. 

A procedure to obtain all AMECs of an MDP is outlined 
in [8]. This procedure is intended to be used for the product 
MDP V, but it can be used without modification to find 
all AMECs associated with V when V'^ is used instead of 
v. This is because the information needed to construct the 
AMECs is the set of all possible state transitions at each 
state, and this information is already contained in V'^ . 

If we denote as the union of all states in all AMECs 
associated with V, it has been shown in probabilistic model 
checking (see e.g., [8]) that the probability of satisfying the 
LTL formula is given by the maximal probability of reaching 
the set from the initial state SpQ. The desired optimal 
policy can then be obtained as the policy maximizing this 
probability. If transition probabilities are available for each 
state-action pair, then the solution to this MRP problem can 
be solved as by a linear program (see [8], [22]). The resultant 
optimal policy is deterministic and (i.e., M = /i/i . . .) on 
the product MDP V. To implement this policy on Ai, it is 
necessary to use the DRA as a feedback automaton to keep 
track of the current state s-p on V, and apply the action u 
where ii{sp,u) = 1 (since fj, is deterministic). 
Remark III.3. It is only necessary to find the optimal 
policy for states not in the set Sp,. This is because by 
construction, there exists a policy inside any AMEC that 
almost surely satisfies the LTL formula (j) by reaching a state 
in Kp> infinitely often. This policy can be obtained by simply 
choosing an action ( among the subset of actions retained by 
the AMEC) at each state randomly, i.e., a trivial randomized 
stationary policy exists that almost surely satisfies <f>. 

B. LSTD Actor-Critic Method 

We now describe how relevant results in [19] can be 



that the RSP ii0{q,u) to be given, and we will describe in 



applied to solve Prob. II.4 An approximate dynamic pro- 



gramming algorithm of the actor-critic type was presented 
in [19], which obtains a stationary randomized policy (RSP) 
(see Def. II. 2 \ M — figfig . . ., where fig{q, u) is a function of 



the state-action pair (q, u) and 6 € K", which is a vector of 
parameters. For the convenience of notations, we denote an 
RSP figiig . . . simply by /xg. In this sub-section we assume 



Sec. III-C on how to design a suitable RSP. 

Given an RSP /ie, actor-Critic algorithms can be applied 
to optimize the parameter vector 6 by policy gradient estima- 
tions. The basic idea is to use stochastic learning techniques 
to find 6 that locally optimizes a cost function. In particular, 
the algorithm presented in [19] is targeted at Stochastic 
Shortest Path (SSP) problems commonly studied in literature 
(see e.g., [22]). Given an MDP M = (Q, Qq, U, A, P, H, h), 
a termination state q* E Q and a function g{q,u) defining 
the one-step cost of applying action u at state q, the expected 
total cost is defined as: 



Km E 

JV-i-oo 



E 

. k=0 



9iqk,uk) 



(1) 



where {qk,Uk) is the state-action pair at time k along a path 
under RSP fig. 

The SSP problem is formulated as the problem of finding 
9* minimizing ([T]i. Note that, in general, we assume q* to be 
cost-free and absorbing, i.e., g{q*, u) = and P{q* ,u, q*) = 
1 for all u £ A{q*). Under these conditions, the expected 
total cost ([T]l is finite. 



We note that an MRP problem as described in Sec. III-A 
can be immediately converted to an SSP problem. 
Definition III.4 (Conversion from MRP to SSP). Given the 
product MDP V = {Sp, spQ, Up, Ap, Pp, F-p) and a set of 
states C Sp, the problem of maximizing the probability 
of reaching can be converted to an SSP problem by 
defining a new MDP V = {Sp> , sp>o , Up> , Ap> , Pp> , gpi ), where 



(i) Sv = {Sp \ S^) U {4}, where 



'dummy" 



terminal state; 

(ii) spo = sptQ (without the loss of generality, we exclude 

the trivial case where s-po € S^)>' 
(Hi) Up> = Up>; 

(iv) Ap>(spi) — Api{sp>) for all s-p G Sp>, and for the dummy 
state we set Ap>{s^) — Up>; 

(v) The transition probability is redefined as follows. We 
first define as the set of states on V that cannot 
reach 5p under any policy. We then define: 



Pv{sp,u,s'p,) 



E Pv{sv,u,s'^), 



Pv{sv,u,s'p,), 



for all sp> G Sp> \ (S^ U S^) and u G Up>. Moreover, for 
all sp> G 5p and u G Up>, we set Pp>{s!^,u, s^) = 1 
and Pp>{sp<,u,£p:Q) = 1; ^ 
(vi) For all sp G Sp> and u G Up, we define the one-step 
cost gp>{sp,u) ~ 1 if Sp G S^, and g{sp>,u) = 
otherwise. 

We have shown in [19] tha^the policy minimizing ([T]l for 
the SSP problem with MDP V and the termination state 
is a policy maximizing the probability of reaching the set 
S'p on V, i.e., a solution to the MRP problem formulated in 
Sec HlKAl 



Policy (Actor Parameter) 
Updated 




Critic Parameter 
Updated 



Fig. 2. Diagram illustrating an actor-critic algoritiim. 



The SSP problem can also be constructed from 
the NTS V^. In Ms case we obtain an NTS 
V^(Sv,svr). Uv,A- p,P^ , q-p), using the exact same 



construction as Def. III.4 except for the definition of PJp 
The transition function PJp {s-p, u, s'-p) is instead defined as: 

Pv {sv,u,sip) 

( max^ P^ {sv,u, s^), if sip = Sp 



PN 



v {s-p,u,Sp,) 



if s'p,(^Sv\ 5* 



for all s-p ^ S-p \ (S'pJJ S^) and u e Up. Moreover, for 
all s-p £ and u e U-p, we set Ppf (sp,7i, s^) — 1 and 
Ppf {sv,u,spo) = 1. 

Once the SSP problem is constructed, the algorithm pre- 
sented in [19] is an iterative procedure that obtains a policy 
that locally minirnizes the cost function ([l) by simulating 
sample paths on V. Each sample paths on V starts at s-pQ 
and ends when the termination state s^ is reached. Since 
the probabilities is needed only along the sample path, we 
do not require the MDP V, but only P-^. 

An actor-critic algorithm operates in the following way: 
the critic observes state and one-step cost from MDP and 
uses observed information to update the critic parameters, 
then the critic parameters are used to update the policy; the 
actor generates the action based on the policy and applies the 
action to the MDP. The algorithm stops when the gradient of 
a{6) is small enough {i.e., 9 is locally optimal). The actor- 
critic update mechanism is shown in Fig. |2] 

We summarize the actor-critic update algorithm in Alg. [T] 
and we note that it does not depend on the form of RSP fi0. 
The vectors Zfe e M",bfc G M", e M" and the matrix 
Ak € M"^" are updated during each critic update, while 
simultaneously, the vector 9k G M" is updated during each 
actor update. Both the critic and actor update depend on 



(2) 



which is the gradient of the logarithm of iJ.g{x, u), to estimate 
the gradient Va(0). Lastly, sequence {7^} controls the critic 
step-size, while {/3k} and r(rfe) control the actor step-size. 
We note that all step-size parameters are positive, and their 
effect on the convergence rate is discussed in [19]. 

The critic update algorithm in Alg. [T] is of the LSTD 
type, which has shown to be superior to other approximate 



dynamic programming methods in terms of the convergence 
rate [18]. More detail of this algorithm can be found in [19]. 

Algorithm 1 LSTD Actor-critic algorithm for SSP problems 

Input: The NTS p-^ {Sv, svo,Uv, Av,Pr ,gv) with the termi- 
nal state Sp, the RSP fj,e, and a computation tool to obtain 
P-p{s-p,u, ■) for a given {s-p, u) state-action pair. 
1: Initialization: Set all entries in zo, Ao, bo and ro to zeros. Let 
60 take some initial value. Set initial state xq :— s-po- Obtain 
action uo using the RSP ne^. 
2: repeat _ 
3: Compute the transition probabilities P{xk, Uk, •)• 
4: Obtain the simulated subsequent state Xk+i using the transi- 
tion probabilities P(xk,Uk, •)■ If ^k ~ Sp, set Xk+i := xq. 
5: Obtain action Uk+i using the RSP fie^ 
6: Critic Update: 

Zk+i = \zk + '4>e^i^k,Uk) 

hk+i = bfc + 7fc (ff(a;fe,iifc)zfe — bfc) 

Afc+i = Ak + -ykizkiipl^ixk+ijUk+i) — tpl^ixkjUk)) 

-Ak), 
Vk+i = -Afe^bfe. 

7: Actor Update: 

Gk+i = 6k~l3kT{rk)rlipg^ {xk+i,Uk+i)i>g^ {xk+i,Uk+i) 
8: until \\\7a{Gk)\\ < e for some given e 



C. Designing an RSP 

In this section we describe a randomized policy suitable 
to be used in Alg. [T] for MRP problems, and do not require 
the transition probabilities. We propose a family of RSPs 
that perform a "t steps look-ahead". This class of policies 
consider all possible sequences of actions in t steps and 
obtain a probability for each action sequence. 

To simplify notation, for a pair of states i,j E Sp, we 
denote « A j if there is a positive probability of reaching 
j from i in i step. This can be quickly verified given 
Ppf without transition probabilities. At state i G S-p, we 
denote an action sequence from i with t steps look-ahead 

as e = uiU2...ut, where Uk £ A-p{j) for some j such 

k 

that i — > j, for all k — 1, . . .t. We denote the set of all 
action sequences from state i as E{i). Given e e E{i), we 
denote Ppf {i,e,j) = 1 if there is a positive probability of 
reaching j from i with the action sequence e. This can also 
be recursively obtained given Pp [i, u, •). 

For each pair of states i,i G S-p, we define d{i,j) as the 
minimum number of steps from i to reach j (this again can be 
obtained quickly from Pp without transition probabilities). 
We denote j E N{i) if and only if d{i,j) < r^, where rjv 
is a fixed integer given apriori. If j E N{i), then we say i 
is in the neighborhood of j, and tat represents the radius of 
the neighborhood around each state. 

For each state i E S-p, We define the safety score saf e(z) 
as the ratio of the neighboring states not in S^ over all 
neighboring states of i. Recall that S^ is the set of states 
with probability of reaching the goal states S^. To be 



more specific, we define: 
safe(i) := 



(3) 



where is an indicator function such that = 1 if and 
only if i g S-p\S^ and = if otherwise. A higher safety 
score for the current state imphes that it is less likely to reach 
S'p in the near future. Furthermore, we define the progress 
score of a state i g S-p as progress(i) :— min^gs^ d{i,j), 
which is the minimum number of transitions from i to any 
goal state. 

We can now present the definition of our RSP. Let 9 
[9i,d2Y. We define: 

a {0, i, e) 

= exp(Oi ^ s&ie{j)P^ (i,e,j) 

+^2 X] (progress (j) - progress («)) 



(4) 



where exp is the exponential function. Note that a{9, i, e) is 
the combination of the expected safety score of the next state 
applying the action sequence e, and the expected improved 
progress score from the current state applying e, weighted 
by 01 and 62- We assign the probability of pick the action 
sequence e at i proportional to the combined score a{9, i, e). 
Hence, the probability to pick action sequence e at state i is 
defined as: 

a (9, i, e) 



fie (i,e) 



(5) 



Note that, if the action sequence e = U1U2 . . . Ui is picked, 
only the first action ui is applied. Hence, at stat i, the 
probability that an action u G A-p{i) can be derived from 
Eq. ^: 



He {i,u) 



X] Me(«,e), (6) 

{e^E{i) I e—uU2...Ut} 

which completes the definition of the RSP. 

D. Overall Algorithm 

We now connect all the pieces together and present the 



overall algorithm giving a solution to Prob. II.4 



Proposition III.5. Alg.^retums infinite time with 9* locally 
maximizing the probability of the RSP jie satisfying the LTL 
formula (j). 

Proof. In [19], we have shown that the actor-critic algorithm 
used in this paper returns in finite time with a locally optimal 
9* such that ||Va(0*)|| < e for a given e. We have shown 
throughout the paper that the optimal policy maximizing the 
probability of reaching on is a policy maximizing 
the probability of satisfying (p. We also showed throughout 
the paper that the SSP problem, as well as the RSP fie can 
be constructed without the transition probabilities, and only 
with Ad-'^ . Therefore, Alg. [2] produces an RSP maximizing 



Algorithm 2 Overall algorithm providing a solution to Prob. 



Input: A labeled NTS = (Q, qo, U, A, , n, h) modeling 
a robot in a partitioned environment, LTL formula over 11, 
and a simulator to compute P{q, u, •) given a state-action pair 

Translate the LTL formula <ji to a DRA TZ^ 
Generate the product NTS = x Tl^. 
Find the union of all AMECs associated with 
Convert from an MRP to an SSP and generate V'^ 
Obtained the RSP /ie^with V^^ 

Execute Alg.[T|with and /le as inputs until ||Vq(0*)|| < e 
for a 6* and a given e 
Output: RSP /ie and 6* locally maximizing the probability of 
satisfying 4> with respect to up to a threshold e 



the probability of satisfying 
threshold e. 



with respect to up to a 



IV. Hardware-in-the-loop simulation 

We test the algorithms proposed in this paper through 
hardware-in-the-loop simulation for the RIDE environment 
(as shown in Fig. [T]). The transition probabilities are com- 
puted by an accurate simulator of RIDE as needed. We apply 
both LTL control synthesis methods of linear programming 
(exact solution) and actor-critic (approximate solution) and 
compare the results. 

A. Environment 

In this case study, we consider an environment whose 
topology is shown in Fig. |3] This environment is made of 
square blocks forming 164 corridors and 84 intersections. 
The corridors (Ci, C2, . . . , C164) shown as white regions in 
Fig.|3]are of three different lengths, one-, two-, and three-unit 
lengths. The three-unit corridors are used to build comers 
in the environment. The intersections (Ii, I2, ■ ■ ■ , Is4) are of 
two types, three-way and four-way, and are shown as grey 
blocks in Fig. [3] The black regions in this figure represent 
the walls of the environment. Note that there is always a 
corridor between two intersections. 

There are five properties of interest (observations) associ- 
ated with the regions of the environment. These properties 
are: VD = ValuableData (regions containing valuable data to 
be collected), RD = RegularData (regions containing regular 
data to be collected). Up = Upload (regions where data can 
be uploaded), Ri = Risky (regions that could pose a threat to 
the robot), and Un = Unsafe (regions that are unsafe for the 
robot). 

B. Construction of the MDP model 

The robot is equipped with a set of feedback control 
primitives (actions) - FollowRoad, GoRight, GoLeft, and 
GoStraight. The controller FollowRoad is only available 
(enabled) at the corridors. At four-way intersections, con- 
trollers are GoRight, GoLeft, and GoStraight. At three- 
way intersections, depending on the shape of the intersection, 
two of the four controllers are available. Due to the presence 
of noise in the actuators and sensors, however, the resulting 




Fig. 3. Schematic representation of tlie environment with 84 intersections 
and 164 conidors. The black blocks represent walls, and the grey and 
white regions are intersection and coiTidors, respectively. There are five 
properties of interest in the regions indicated with VD = ValuableData, RD 
= RegularData, Up = Upload, Ri = Risky, and Un = Unsafe. The initial 
position of the robot is shown with a blue disk and the upload region is 
indicated with a red star 

motion may be different than intended. Thus, the outcome 
of each control primitive is characterized probabiHstically. 

To create an MDP model of the robot in RIDE, we define 
each state of the MDP as a collection of two adjacent regions 
(a corridor and an intersection). For instance the pairs C1-/2 
and /3-C4 are two states of the MDP. Through this pairing 
of regions, it was shown that the Markov property (i.e., the 
result of an action at a state depends only on the current state) 
can be achieved [14]. The resulting MDP has 608 states. 

The set of actions available at a state is the set of 
controllers available at the last region corresponding to the 
state. For example, when in state C1-/2 only those actions 
from region I2 are allowed. Each state of the MDP whose 
second region satisfies an observation in 11 is mapped to that 
observation. 

To obtain transition probabilities, we use an accurate 
simulator (see Fig. [4]) incorporating the motion and sensing 
of an iRobot Create platform with a Hokoyu URG-04LX 
laser range finder, APSX RW-210 RFID reader, and an MSI 
Wind U100-420US netbook (the robot is shown in Fig. [T]) 
in RIDE. Specifically, it emulates experimentally measured 
response times, sensing and control errors, and noise levels 
and distributions in the laser scanner readings. More detail 
for the software implementation of the simulator can be 
found in [14]. We perform a total of 1,000 simulations for 
each action available in each MDP state. 

C. Task specification and results 

We consider the following mission task: 




Fig. 4. Simulation snapshots. The white disk represents the robot and 
the different circles around it indicate different "zones" in which different 
controllers are activated. The yellow dots represent the laser readings used 
to define the target angle, (a) The robot centers itself on a stretch of corridor 
by using FollowRoad; (b) The robot applies GoRight in an intersection; 
(c) The robot applies GoLef t. 
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Fig. 5. The optimal solution (the maximal probability of satisfying the 
specification) is shown with the dashed line, and the solid line represents 
the exact reachability probability for the RSP as a function of the number 
of iterations applying the proposed algorithm. 

Specification: Reach a location with ValuableData (VD) or 
RegularData (RD), and then reach Upload (Up). Do not 
reach Risky (Ri) regions unless eventually reach a location 
with ValuableData (VD). Always avoid Unsafe (Un) regions 
until Upload (Up) is reached (and mission completed). 

The above task specification can be translated to the LTL 
formula: 

(f) := FUpA(-UnUUp)AG(Ri — ^ FVD) 

AG(VDVRD — ^XFUp) (7) 

The initial position of the robot is shown as a blue circle 
in Fig. [3] with the orientation towards the neighboring inter- 
section. We used the computational frameworks described 



in this paper to find the control strategy maximizing the 
probabilities of satisfying the specification. The size of the 
DRA is 17 which results in the product MDP with 10336 
states. By applying both methods of linear programming 
(exact solution) and actor-critic (approximate solution), we 
found the maximum probabilities of satisfying the specifi- 
cation were 92% and 75%, respectively. The graph of the 
convergence of the actor-critic solution is shown in Fig. |5] 
The parameters for this examples are: A = 0.9, and the initial 
= [5, -O.S]'^. The look-ahead window t for the RSP is 2. 

It should be emphasized that, we only compute the tran- 
sition probabilities along the sample path. Thus, when Alg. 
|2]is completed (at iteration 1100), at most 1100 transition 
probabilities of state-action pairs were computed. In com- 
parison, in order to solve the probability exactly, arround 
30000 transition probabilities of state-action pairs must be 
computed. 

V. Conclusions 

We presented a framework that brings together an approx- 
imate dynamic programming computational method of the 
actor critic type, with formal control synthesis for Markov 
Decision Processes (MDPs) from temporal logic specifi- 
cations. We show that this approach is particular suitable 
for problems where the transition probabilities of the MDP 
are difficult or computationally expensive to compute, such 
as for many robotic applications. We show that this ap- 
proach effectively finds an approximate optimal policy within 
a class of randomized stationary polices maximizing the 
probability of satisfying the temporal logic formula. Future 
direction includes extending this result to multi-robot teams, 
examining exactly how to choose an appropriate look-ahead 
window when designing the RSP, and applying the result 
to more realistic problem settings with the MDP containing 
possibility millions of states. 
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